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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results c nsiderabW. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than Fast a, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and Fast a. Their test with blast 
considered the ability to detect homoiogs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used prositc (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15). which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homoloes. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and PROSTO are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homoiogs missed by older programs. For instance 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1 6 
other pir superfamilies (16). 

To surmount these sorts of difficulties. Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a leneth- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this anaivsis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20). but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection ) 
A general solution to the problem of scoring comes from 
statistical measures (i.e.. E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlm and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch In 
addition to being heralded as a reliable means of recognizing 
sipnficamry similar proteins (24, 25). the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" (1 ). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
bio logical sequences (26, 27) and obviously do not contain anv 
real homoiogs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25 
28) there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29). it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function it 
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is very probable that they have an evolutionary reUtionship 
though their sequence similarity mav be low. " 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the scop database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homoiogs. the 
large majority of which can be determined unambiguously. 
TTiese superfamilies, such as the globins or the immunoglobu- 
1ms would be recognized as related by the vast maioritv of the 
£2? commun,tv d «P«c the lack of high sequence sim- 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (pob) (30) and created two 
databases. One (pdbwd-b) has domains, which were all <90% 
identica toany other, whereas (PDB4CD-B) had those <40% 
identical The databases were created by first sorting all 
protein domains in scop by their quality and making a list The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 

and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains U23 domains, which have 9,044 ordered pairs of 
distant relationships, or -03% of the total 1.749.006 ordered 
pairs. In PDB90D-B, the Z079 domains have 53,988 relation- 
ships, representing 12% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked m both databases by processing with the SEC proeram 
(27) using recommended parameters: 12 1.8 10. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sssA and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/ 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy ove representation in the pdb of a small number of 
families (31. 32). whereas pdbwd-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homoiogs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homoiogs and a 
new assessment criterion. 

The analyses tested blast (1). version 1.4.9MP. and wu- 
blast: (2). version 2.0al3MP. Also assessed was the fasta 
package, version 3.0t76 (3). which provided Fasta and the 
ssearch implementation of Smith-Waterman (8) For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-W 1 (7. 16). The default parameters and matrix (blo- 
SUM62) were used for blast and wu-blastc. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a querv to search the database 
This yielded ordered pairs of querv and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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enem f errors. Second, ssearch. wu-bi^stz and fasta 

?«L J.^T ^ thou 8 h BU * T facta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate f r rapid initial searches. 
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n«s J, K U,i ° n ^ re,a,i ° nShips at - 
Thus if the procedures assessed here fail to find a reliable 
match. ,t does not imply that the sequence is unique: raSer i, 
indicates that any relatives i, might* have are dis.anl on« 
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